Goto

Collaborating Authors

 training speed and model selection



A Bayesian Perspective on Training Speed and Model Selection

Neural Information Processing Systems

We take a Bayesian perspective to illustrate a connection between training speed and the marginal likelihood in linear models. This provides two major insights: first, that a measure of a model's training speed can be used to estimate its marginal likelihood. Second, that this measure, under certain conditions, predicts the relative weighting of models in linear model combinations trained to minimize a regression loss. We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks. We further provide encouraging empirical evidence that the intuition developed in these settings also holds for deep neural networks trained with stochastic gradient descent. Our results suggest a promising new direction towards explaining why neural networks trained with stochastic gradient descent are biased towards functions that generalize well.


Review for NeurIPS paper: A Bayesian Perspective on Training Speed and Model Selection

Neural Information Processing Systems

Weaknesses: At Eq. 5, the authors introduce two sampling based estimators of the lower bound (LB). I am not sure why the authors introduced both as an estimator for the LB: The second estimator is an unbiased estimator of the (log) marginal likelihood (ML). Though it could technically be considered a biased estimator of LB, I do not see why it should be introduced as such, since it is the unbiased estimator of the exact value the authors are hoping to approximate. Actually, in the following sentence the authors write that the second estimator's bias decreases as J is increased, which is very much expected, if not almost trivial considering the point above. Another point is that when J 1 the two estimators are algebraically the same, therefore the first one also becomes (a noisy) unbiased estimator of ML.


Review for NeurIPS paper: A Bayesian Perspective on Training Speed and Model Selection

Neural Information Processing Systems

The work considers SGD training for Bayesian linear models and illustrate a connection between training speed and generalization and why SGD tends to select simpler models. In particular, the work illustrates that a particular type of posterior sampling from gradient descent yields same model rankings as that based on the true posterior under suitable assumptions. Experiments on deep nets are also presented. The reviewers liked the work overall, but felt that some aspects of the exposition were unclear, the transition and implications for deep nets is not quite convincing especially since there is now better understanding of both optimization and generalization in deep nets, and baseline comparisons (e.g., sgld, L2 regularization, dropout, etc.) would strengthen the work.


A Bayesian Perspective on Training Speed and Model Selection

Neural Information Processing Systems

We take a Bayesian perspective to illustrate a connection between training speed and the marginal likelihood in linear models. This provides two major insights: first, that a measure of a model's training speed can be used to estimate its marginal likelihood. Second, that this measure, under certain conditions, predicts the relative weighting of models in linear model combinations trained to minimize a regression loss. We verify our results in model selection tasks for linear models and for the infinite-width limit of deep neural networks. We further provide encouraging empirical evidence that the intuition developed in these settings also holds for deep neural networks trained with stochastic gradient descent.


Generalization Through the Lens of Learning Dynamics

Lyle, Clare

arXiv.org Artificial Intelligence

A machine learning (ML) system must learn not only to match the output of a target function on a training set, but also to generalize to novel situations in order to yield accurate predictions at deployment. In most practical applications, the user cannot exhaustively enumerate every possible input to the model; strong generalization performance is therefore crucial to the development of ML systems which are performant and reliable enough to be deployed in the real world. While generalization is well-understood theoretically in a number of hypothesis classes, the impressive generalization performance of deep neural networks has stymied theoreticians. In deep reinforcement learning (RL), our understanding of generalization is further complicated by the conflict between generalization and stability in widely-used RL algorithms. This thesis will provide insight into generalization by studying the learning dynamics of deep neural networks in both supervised and reinforcement learning tasks.